Now that the main components of the architecture have been decided, choices must be made on the overall design.
User dashboards may be very costly in a website. Each user has an opportunity to design his own personal website, it comes with the cost of storing all that information. Efforts have been made (and are still being made) to reduce this cost, but there will always be an overhead.
This overhead might be hard to estimate as this will depend a lot on how the users will navigate through the website. Maybe only a minority will use this functionality, or maybe the website is only made of dashboards. In any case the impact of making this feature available must be measured by:
estimating the number of dashboards and pages that will be created
observing the impact on the database (through JCR) in terms of size
The JCR implementation uses Apache Lucene for indexing the data. The indexes are used to search for content (It can be page nodes or WCM content for instance).
Lucene isn't cluster-ready, but on a cluster, each node will need to be able to search for content and will need to have access to the lucene indexes.
When is comes to searching, there is always a tradeoff. Everyone would want to achieve all of the following:
Fast to search
Fast to index
Same search result on each node at the same time
No need to rebuild the index ever (No inconsistency)
No impact on overall performance
Easy to setup (no infrastructure change)
But there are choices to be made. The JCR implementation used by EPP (eXo JCR) makes it possible to configure the storage and retrieval of indexes according to architect's choice on where it is acceptable to relax some constraints. For configuration details please refer to the EPP Reference Guide.
This is only for a non-cluster environment, this is obviously the easiest setup, with a combination of in-memory and file based indexes. There is no replication involved so any entry can be found by a search as soon as it is created.
This environment is easy to set up, each node keeps a local copy of the full indexes so that when a search is requested on a node, there is no network communication required. The downside is that when a node indexes an item, it is required to replicate that index on each and every node. If a node is unavailable at that time, it may miss an index update request and then the different nodes may be inconsistent.
Also when a node is added, it has to recreate it's own full index.
An alternative to this setup is to ask a node to retrieve the info from a coordinator on each search, which makes the startup of the new node faster, but impacts its performance during the runtime. This setup is new to EPP 5.2.